Normal distribution & central limit theorem

Lets assume the monthly growth rate follows following distribution:

What is the distribution of heights from 10000 people at age 16?

This is an example the displays the central limit theorem, which states that the result of processes that manifest as the sum of many small identical and independently distributed events are normally distributed.

One way to explain why this is the case is to see that there are more possible combinations of events that lead to average outcomes than possible combination of events that lead to extreme events.

For instance, assume that you are throwing a fair coin four times, and each time heads shows you receive one credit point and each time tail shows you loos a credit point. The next table shows that there are more possible sequences that lead to an end result of 0 credit points than sequences that lead to 4 or more credit points.

Permutation event 1 event 2 event 3 event 4 sum
1 -1 -1 -1 -1 -4
2 1 -1 -1 -1 -2
3 -1 1 -1 -1 -2
4 1 1 -1 -1 0
5 -1 -1 1 -1 -2
6 1 -1 1 -1 0
7 -1 1 1 -1 0
8 1 1 1 -1 2
9 -1 -1 -1 1 -2
10 1 -1 -1 1 0
11 -1 1 -1 1 0
12 1 1 -1 1 2
13 -1 -1 1 1 0
14 1 -1 1 1 2
15 -1 1 1 1 2
16 1 1 1 1 4

Now lets do the same experiment again, except that we are not looking at 4, but 16 tosses, which leads to \(2^{16}\) or 6.5536^{4} possible sequences. Here is the distribution of credit points.

One popular device to display such a process is a Galton1 board:

Linear regression model

What is the association between length and weight at birth?

When data covary,we look at e.g. a scatter plot, which shows the joint distribution, to see how the data are related.

What is marginalization?

Sometimes, we want information about only one dimension of the data. This information is shown in the marginal distribution.

In this plot, each histogram shows the marginal distribution of length and weight, respectively. E.g. to get the frequency of length = 190cm, we sum all individuals with the eight, regardless of their weight.

Modeling the mean

Describing the model

To model such data, we typically think first about the likelihood function. The question here is, which distribution describes best the data we observed. Given that we can think of birth weight as the result of a sum of many processes, a normal distribution, with parameters \(\mu\) and \(\sigma\) for mean and standard deviations, makes sense. (One easy way to spot parameters is that they are typically Greek letters, whereas data variables are Roman letters.)


What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)


Parameters \(\mu\) and \(\sigma\) are variables that we cannot observe directly, we have to estimate them from the data and the prior. The table above already shows how they depend on the data, but we still need to add that they also depend on the prior:


What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)
Prior \(\mu \sim Normal(3.5,1.5)\) mu ~ dnorm(3.5, 1.5)
Prior \(\sigma \sim Uniform(0,1.5)\) sigma ~ dunif(0, 1.5)


And if we want to impress (or scare off) a colleague, we can write down the full model:

\[ P(\mu,\sigma|w) = \frac{\prod_i Normal(w_i|\mu,\sigma) Normal(mu|3.5,1.5) Uniform(\sigma|0,1.5)}{\int\int \prod_i Normal(w_i|\mu,\sigma) Normal(mu|3.5,1.5) Uniform(\sigma|0,1.5)d\mu d\sigma} \]

This looks scarier than it is. The numerator just means to calculate the following for each combination of \(\mu\) and \(\sigma\)

  1. for each participant \(i\) calculate the product of
  • likelihood (\(Normal(w_i|\mu,\sigma)\)) and
  • prior (\(Normal(mu|3.5,1.5) Uniform(\sigma|0,1.5)\))
  1. calculate the product of all values generated in 1.

This is not how the calculation is really performed (expect when one uses grid approximation). Instead, methods lake Lapalce approximation or MCMC are used to calculate this quantity.

The denominator is what is called the evidence, and it’s main purpose is to insure that the posterior sums to 1.

Prior predictive check

It is good predictive to check if the model with priors make sensible predictions, before one estimates the model parameters. The goal is not to prior-predict data that are very similar to observed data, but to prior-predict data that pass plausibility checks and are in the same ball park as the observed data. Plausibility checks are simple things like that there are not negative weights. With “in the same ball park” one means that the values should ave approximately the same order of magnitude. For instance, newborns are a few kilogram heavy, and not 10s or 100s of kilogram. Prior predictive checks rely on domain knowledge, and can be very useful in understanding the effect of multivariate priors. We still give it a quick try for our simple example:

prior.pedictive.weights = 
  rnorm(
    10000,
    rnorm(10000, 3.5, 1.5),
    runif(10000, 0, 1.5))

hist(prior.pedictive.weights,
     main = "", xlab = "prior predictive weights")
text(-2,2000,expression(mu~"="~Normal(3.5,1.5)))
text(-2,1850,expression(sigma~"="~Uniform(3.5,1.5)))

This looks reasonable, even though we surely know that there are no newborns with a weight below 0 or above 6 kilogram. But the point of the prior is not to forbid impossible values. Such impossible values should be rare, and if they have to be allowed in order to insure a weak influence of the prior on the results, this is OK.

How does it look if we use non-informative priors?

It seems obvious that the first set of parameters makes more sense.

Estimating the model parameters

To analyse the model, we can just use the data.frame we created above:

head(dt)
##     length   weight
## 1 49.62242 3.518569
## 2 57.07637 4.320868
## 3 54.32599 3.364639
## 4 46.48027 2.184665
## 5 48.24516 2.644656
## 6 50.33399 3.641923

Based on the quap code in the table above, we can also put the quap model. alist is a command that creates a list that holds our model.

alpha.model.list =
  alist(
    weight ~ dnorm(mu,sigma),
    mu <- alpha,
    alpha  ~ dnorm(3.5,1.5),
    sigma  ~ dunif(0,1.5)
  )

For reasons that will be clear soon, we are using a parameter \(\alpha\) as a determinent of the mean \(\mu\).

This is what the model list looks like:

alpha.model.list
## [[1]]
## weight ~ dnorm(mu, sigma)
## 
## [[2]]
## mu <- alpha
## 
## [[3]]
## alpha ~ dnorm(3.5, 1.5)
## 
## [[4]]
## sigma ~ dunif(0, 1.5)

Now we can use quap to calculate the posterior distribution.

alpha.model.fit = quap(alpha.model.list, data=dt)

And we can have a first glimpse of the results with the precis function from the rethinking package:

precis(alpha.model.fit)
##            mean         sd     5.5%     94.5%
## alpha 3.5523808 0.04524070 3.480077 3.6246842
## sigma 0.7156439 0.03200734 0.664490 0.7667978

Posterior predictive check

Before we start interpreting our data, it is always a good idea to see if the model is any good at describing the data.

To do this, we first extract the posterior from the prior:

alpha.model.post = extract.samples(alpha.model.fit,n=1e4)
alpha.model.post$mu = alpha.model.post$alpha
head(alpha.model.post)
##      alpha     sigma       mu
## 1 3.605764 0.7428776 3.605764
## 2 3.626048 0.7147510 3.626048
## 3 3.556837 0.7504723 3.556837
## 4 3.511052 0.6793289 3.511052
## 5 3.683530 0.7295955 3.683530
## 6 3.553346 0.7010214 3.553346

One thing we would like to check is if the observed mean is within the credible interval of the posterior for \(\mu\):

hist(alpha.model.post$mu, main = "",xlab = expression(mu))
abline(v = HPDI(alpha.model.post$mu), col = "blue")
abline(v = mean(dt$weight), lwd = 2)

We also expect that most of the data lies within the posterior predicted values. First we calculate the posterior predictions, which depend on \(\mu\) and \(\sigma\):

posterior.predictions = 
  rnorm(nrow(alpha.model.post),
        alpha.model.post$mu,
        alpha.model.post$sigma)

hist(dt$weight, col = "gray10", main = "", xlab = "weight")
hdpi = HPDI(posterior.predictions)
abline(v = hdpi , col = "blue")
in.hdpi = mean(dt$weight > hdpi[1] & dt$weight < hdpi[2])

title(paste0(round(in.hdpi*100),"% of weights are in the 89% HDPI"))

Now let’s simulate 200 weights and see if they are associated with length:

par(mfrow = c(1,2))
plot(dt,
     ylab = "weight",
     xlab = "length")

pp.weight = rnorm(250,
      alpha.model.post$mu[1:250],
      alpha.model.post$sigma[1:250])
plot(dt$length,pp.weight,
     ylab = "posterior prediction weight",
     xlab = "length",
     col = "blue")

Linear regression

Differently than in the observed data, the predicted data do not show an association between length and weight. This is obviously because we did not use length to predict weight. Instead of the following model, where each individuals weight depends only on the group mean, we want a model in which individual weights also depend on length.

par(mfrow = c(1,2))
plot(dt,
     ylab = "weight",
     xlab = "length",
     ylim = ylim)
abline(h = sample(alpha.model.post$mu,100),col = adjustcolor("blue",alpha = .2))
xs = runif(100,39.75,40.25)
arrows(xs,rep(0,100),xs,
       sample(alpha.model.post$mu,250), col = adjustcolor("red",alpha = .1),
       code = 3)

plot(dt,
     ylab = "weight",
     xlab = "length",
     ylim = ylim)
abline(lm(dt$weight~dt$length), lty = 3, col = "blue", lwd = 3)

Looking at the right-hand side of the plot, we see that it is not so easy to think about what the mean weight is, length has a difficult scale. We can just re-scale it.

dt$length.s = scale(dt$length)
plot(dt$length.s,
     dt$weight,
     ylab = "weight",
     xlab = "length",
     ylim = ylim)

Now it is easier to see that the when length = 0 (the average) weight should be around 3.5 kg, and that when the weight goes up by 5 units (-3 to 2), length goes up by 4 units (2 to 6), that 0.8kg per cm length.

The model

Next we “build” the new model, which can be visualized as follows:

Here is the full model


What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)
Trans. paras. \(\mu_i = \alpha + \beta l_i\) mu[i] <- a + b * length.s[i]
Prior \(\alpha \sim Normal(3.5,1.5)\) alpha ~ dnorm(3.5, 1.5)
Prior \(\beta \sim Normal(1,1)\) beta ~ dnorm(1, 1)
Prior \(\sigma \sim Uniform(0,1.5)\) sigma ~ dunif(0, 1.5)


The only real difference to the previous model is in lines 2 and 4.

Prior predictive check

We do again a prior predictive check to see if model and prior are broadly in line with the data.

plot(0,
     ylab = "weight",
     xlab = "scaled length",
     ylim = c(1,7),
     xlim = c(-3,3))
clr = adjustcolor("black",alpha = .25)
for (k in 1:50)
  abline(rnorm(1,3.5,1.5), rnorm(1,1,1), col = clr)

This looks pretty wild. It is implausible that we have a negative association between length and weight, and the mean weight covers an implausibly large range. We can do better than that, without that the model priors determine the fitting results.

plot(0,
     ylab = "weight",
     xlab = "scaled length",
     ylim = c(1,7),
     xlim = c(-3,3))
for (k in 1:50)
  abline(rnorm(1,3.5,1), rlnorm(1,-.25,.5), col = clr)

This looks much more reasonable. Here is the model with new priors:

What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)
Trans. paras. \(\mu_i = \alpha + \beta l_i\) mu[i] <- a + b * length.s[i]
Prior \(\alpha \sim logNormal(3.5, 1)\) alpha ~ dnorm(3.5, 1)
Prior \(\beta \sim Normal(-.25, .5)\) beta ~ dlnorm(-.25, .5)
Prior \(\sigma \sim Uniform(0,1.5)\) sigma ~ dunif(0, 1.5)

And here is the quap model:

mu.model.list =
  alist(
    weight ~ dnorm(mu,sigma),
    mu <- a + b * length.s,
    a  ~ dnorm(3.5,1),
    b  ~ dlnorm(-.25,.5),
    sigma  ~ dunif(0,1.5)
  )

Estimating the model parameters

We fit the quap model.

mu.model.fit = quap(mu.model.list, data=dt)
precis(mu.model.fit)
##            mean         sd      5.5%     94.5%
## a     3.5523821 0.03112969 3.5026309 3.6021334
## b     0.5213863 0.03095314 0.4719172 0.5708553
## sigma 0.4924423 0.02202246 0.4572462 0.5276385

Posterior predictive check

As should become custom, we check if our model describes the data reasonably well. We first extract the posterior samples.

mu.model.post = extract.samples(mu.model.fit)

Next we calculate the “transformed parameter” \(\mu\) for each participant. Because we have 10000 posterior samples and 250 participants, we end up with a 10000 time 250 matrix of posterior predictions.

n_samples = nrow(mu.model.post)
mu.model.pp = 
  apply(dt, # instead of a for loop
        1,  # for each row in dt
        function(x){
          length.s = x[3] # take the scales length from pos. 3
          with(data.frame(mu.model.post),
               {
                 mu = a + b * length.s           # calculate mu
                 pp = rnorm(n_samples,mu,sigma)  # generate post. preds
                 return(pp)
               }
          )
        })
dim(mu.model.pp)
## [1] 10000   250

And we check if the predicted weights are consistent with the observed means.

Posterior predictive checks do not need to be limited to simlpy the outcome variable. We can check if any aspect of the data we care about works. Here we check if mean and sd of the predicted data lie within the credible interval of the posterior.

par(mfrow = c(1,2))

pp.mu = apply(mu.model.pp, 1, mean)
pp.sd = apply(mu.model.pp, 1 , sd)

hist(pp.mu, main = "")
abline(v = c(HPDI(pp.mu),mean(dt$weight)), 
       col = c("blue","blue","black"))

hist(pp.sd, main = "")
abline(v = c(HPDI(pp.sd),sd(dt$weight)), 
       col = c("blue","blue","black"))

We can also check if we see the covariation of birth length and birth weight of we use predicted weights.

Describe the results

Now that we have convinced of the that the model describes the data well enough (yes, this is to a degree subjective) we can present our results.

As a visual presentation we can e.g. just show the slope:

plot(dt$length.s,
     dt$weight,
     col = "grey", 
     ylab = "weight",
     xlab = "length",
     xaxt = "n")
x.ticks = seq(35,60,by = 5)
axis(1, 
     at = (x.ticks-mean(dt$length))/sd(dt$length), 
     labels = x.ticks)
for(k in 1:25) 
  abline(a = mu.model.post$a[k], 
         b = mu.model.post$b[k],
         col = adjustcolor("black",alpha = .2))

Splines


  1. Who certainly was clever, but is nowadays also infamous for his views on eugenics and race.↩︎